1 Recap

To briefly recap, last week we

  • Covered more data manipulation (e.g. filter, summarise)
  • Learned to identify issues in data (e.g. country names) and sort those issues out
  • Learned how to join data sets by an ID column (or more than one)
  • Learned to summarise data (e.g. total deaths) and to do this for each group in a data set using group_by()
  • Leaned to do basic data visualisations with ggplot() and identify some common plotting issues
  • Learned about geoms and how they can be used to make different sorts of plots
  • Learned how to plot grouped data on the same plot (using different colours, shapes, and line types) and on different plots (facets)

Which led nicely onto…

2 Homework

First of all there isn’t a right or a wrong way of doing last week’s homework. Actually, this is usually the case with R - there are better (easier to read code, less likely to make mistakes, faster) and worse (the opposite of all of those things) ways of doing things but not neccesarily a right and wrong way. Your homework was about thinking outside of the box, being creative, investigating the data, and trying to come up with an elegent visualisation. Here is one option, working from the long format of the data you generated last week:

## make a new column in long_spp which is the number of threats recorded for each population
##this is the length of the non-NA values in each row of the colunns primary_threat, secondary_threat, and tertiary_threat: 
long_spp <- long_spp %>% 
              rowwise %>%
              mutate(n.threats = length(which(!is.na(c(primary_threat, 
                                                       secondary_threat, 
                                                       tertiary_threat)))))

##load in the scales package for the psudo_log transformation
library(scales)

##build the ggplot, setting the date to a date format
ggplot(long_spp, aes(x=as.Date(date), 
                     y=abundance, 
                     col=n.threats))+
  ##set the theme to minimal
  theme_minimal() +
  ##position the legend at the top on the left
  theme(legend.position = "top", 
        legend.justification="left") +
  ## change the automatic label for the legend to the following
  ## note it is done twice (once for fill and once for col) as 
  ## we change the colour of the lines and the fill of the smooth according to the number of threats
  labs(col="Number of threats", fill="Number of threats") +
  ## add in the lines of the population, using the group and interaction functions
  ## to specify how the data are grouped and get 1 line per population 
  geom_line(aes(group = interaction(population, 
                                    species, 
                                    n.threats, 
                                    population)), 
                alpha=0.2, 
                size=1) + 
  ## add in a smoothed line across all of the population time series for each number of threats
  ## this visualises the mean change across the populations
  geom_smooth(aes(group = n.threats, 
                  fill = n.threats), alpha=0.3, size=1) +
  ##change the colours to reflect the number of threats, using a gradient
  scale_color_gradient(low = "#745745", 
                       high = "#d14124") +
  scale_fill_gradient(low = "#745745", 
                      high = "#d14124") +
  ## transform the y axis using a psudo_log transformation 
  ## (as we have some 0 counts and you can't log 0)
  scale_y_continuous(trans="pseudo_log") +
  ##change the x and y axis labels
  ylab("Population abundance") +
  xlab("Year")

So here we have thought a little outside the box and looked at how the number of stressors that a population is threatened by alters its trajetory through time - it seems like the more threats you have the faster you are declining, on average.

How you visualise your data will of course depend on exactly the questions you are interested in. Plotting as above gives the advantage of being able to directly compare between the different groups (0, 1, 2, 3 stressors) but at the expense of the species-level data.

A couple of other neat tricks are worth highlighting:

  1. plotting on log scales allows data which consist of both large and small values to be plotted on the same plot (and be discernable)
  2. plotting a smoothed version of the data (in this case a loess smoothing across data grouped by the number of stressors) can allow you to better visualise trends
  3. using a theme other than the base one in ggplot (which is a pretty horrible grey) makes a big difference

3 Statistics in R

So far you have learned to read in data, manipulate that data, and then do basic visulisation of that data. This week we are going to take the obvious next step and consider how we run statistical analyses in R.

The first point to make here is this is going to be a brief introduction to some useful statistical methods. R was originally developed as a statistical programming language (i.e. not primarily used for data visualisation and manipulation) and consequently any analysis you can think of you can probably do in R (including implementing neural networks, MCMC bayesian analyses, social network analysis, etc, etc).

We are going to focus on a single suite of modelling methods in-depth, and analyse the species and threats data you plotted for homework.

3.1 GLM(M)s

3.1.1 What is a glm?

Generalized Linear Models (GLMs) are a critical and hugely useful technique for data analysis. They are Generalised Linear Models because they allow you to specify the error distribution of your model (e.g. gaussian, poisson, etc) and link function (e.g. identity, log, etc). This means that they are hugly flexible you can fit them to count data, binary data, or data where the predictor variables are categorical or continuous. Critically, GLMs can be fit to data where the errors are normal or non-normally distributed. Linear regressions - which I am sure you are all familiar with - are a special case of the GLM, where data has a gaussian (normal) distribution with an identity link (see below).

3.1.2 Why are we studying GLMs?

As we said above, GLMs are fabulously flexible - they can do ANOVA style analyses with categorical predictors, linear regressions with continuous predictor variables, and mix both categorical and continuous predictors in multiple regression analyses. We will also consider the extension to GLMs where we can account for non-randomness between samples (so called mixed models). GLMs are complex beasts, but are widely used to analyse biological data because of its often nested nature, and lack of normality, and so we are going to cover this topic in-depth here rather than spread thinly across many different statistical techniques. We will touch on some other statistics in upcoming lectures.


The following two sections are a quick refresher on error distributions and link functions. Feel free to skip these if you are comfortable with the topics, or come back to them for homework if you aren’t as familiar with the content.

3.1.3 Error distributions

An error distribution specifies the structure of the error in the model - i.e. the variance in the data not explained by the predictor variables. Take the example of a gaussian (normal) distribution - we expect the residuals of the model to be normally distributed around the fitted model:

Here we have a linear model (blue horizontal line in the Fitted model pannel) where there is no effect of a change in the x value on the y values (i.e. the slope is ~0). The data are plotted in light blue, and the residual error (difference between the predicted values - i.e. the blue horizontal line - and the observed values - i.e. the light blue dots - is shown by the grey lines). You can see that the residuals are normally distributed around that fitted blue line (see histogram of residuals). Remember, we are trying to minimise the residual error as less error = a model which fits the data better. So we are looking for a roughly normal distribution of errors in our histogram of the residuals (☑), with a roughly linear patter in our fitted vs residual plot (☑).

If we now fit a gaussian model to count data:

You will see that the residuals are very not normally distributed (because we are working with counts which are (1) integer values (whilst the normal distribution produce any real number value between negative infinitly and positive infinity), and (2) count data are bounded at 0.

Error distributions allow us to deal with these non-normally distributed data, in the above example we would probably use a poisson distribution (specifically designed to cope with positive integer values).

For more information on the error families for glm’s have a look at the help - ?family.

3.1.5 Fitting a GLM

So, let’s think about fitting a GLM to data. We will work first of all on a single population time series from the species data you plotted for homework last week and we are going to see whether there has been a significant increase, decrease, or no change in that population over time, and if there is a change what the rate of that change is. We will go through the process of fitting a glm to the data, checking the fit and other potential models, and then think about plotting the outputs of the model.


Task

Ok, so first off let’s pull out the data. Load in the species data you used last week from the guthub repository (using vroom and the url to the data), and then reshape your data into long format. However, we arent interested in any values which have missing data, so we want to drop NA values during the pivot. You will need to modify your code to do this.


First off, let’s take a quick look at this big data set to refresh our memory on the structure.


Task

Use the print() function to look the data.


If we look at this print out we can see that the date column isn’t specified in the date format, so let’s set that to a date first.


Task

Set the date column to date format.


So, the important bits for us (remember we want to know if there has been a significant change in the population over time) are going to be the date, abundance, and species columns for the moment.

As we said, we are going to use a single time series, so let’s extract the time series - using filter() for Trichocolea tomentella from the species data we loaded in.


Task

Filter the data to make a new data frame called single_spp containing only data on Trichocolea tomentella


And, as we should before all data analysis, we should visualise the data before we do anything. We will do this using ggplot() and fit a loess smoothing line so we can visualise any trends in the data:

##make the plot
p1 <- ggplot(single_spp, aes(x=date, y=abundance)) + 
  geom_point() + 
  geom_line() + 
  theme_bw() +
  ylab("Abundance") +
  xlab("Year")
##add the loess smoothing:
p1 + geom_smooth(method="loess")

Ok, so I think its pretty clear that the population is declining! But we want to know about how fast that population is changing - so we should fit a model to these data.

To run a glm() in R is easy - its a base function (so no packages need installing).

One thing to consider is the x-axis on the plot above - R will accept date as a predictor variable (make sure it is set as a date!) but it is often better to replace the date with a numeric vector as it is easier to interpret (see below in the Model summaries and outputs section) - we are going to do this with the code below:

## calculate a new column (`standardised_time`) which is the difference between the 
## starting date of the time series and each other date in weeks (see ?difftime)
## we will set this to a numeric vector
single_spp <- single_spp %>% 
                mutate(standardised_time = as.numeric(difftime(as.Date(date), 
                                                               min(as.Date(date)), 
                                                               units = "weeks")))

print(single_spp[,c("abundance", "date", "standardised_time")], 30)
## # A tibble: 18 x 3
##    abundance date       standardised_time
##        <dbl> <date>                 <dbl>
##  1       245 2003-01-01             209. 
##  2       231 2004-01-01             261. 
##  3       125 2005-01-01             313. 
##  4       121 2006-01-01             365. 
##  5       125 2007-01-01             417. 
##  6        87 2008-01-01             470. 
##  7        79 2009-01-01             522. 
##  8        97 2010-01-01             574  
##  9        42 2011-01-01             626. 
## 10        35 2012-01-01             678. 
## 11        41 2013-01-01             731. 
## 12        49 2014-01-01             783. 
## 13        55 2015-01-01             835. 
## 14        40 2016-01-01             887  
## 15       336 2000-01-01              52.1
## 16       212 2001-01-01             104. 
## 17       204 2002-01-01             157. 
## 18       368 1999-01-01               0

Now to fit our glm:

##fit a glm()
mod1 <- glm(abundance ~ standardised_time, 
            data = single_spp, 
            family = "gaussian")

Here we specify the formula of the model using ‘~’, where the first term is the dependant (y) variable, and the following term/terms is/are the independant variables. So “abundance” is our dependant vairaible, and standardised_time is our independant variable.

Hint - you can read ~ in R as “as a function of”. So in this case, we are testing whether the abundance of the species is a function of standardised time.

We have also specified the “family” to be “gaussian” (actually we didnt need to do this as this is the default setting, but its good practice to make sure we understand what exactly we are doing). Family here means the error structure, so we are assuming (initially) that our residuals are normally distributed, this means that - given our current settings - we are actually just fitting a linear regression to the data.

3.1.6 Assesing the fit of our model

Statistics is as much art as science, and assessing how well a model fits data is complex and time consuming. There isn’t an automated or best approach for doing it.

Remember:

As an initial pass we can produce plots as I showed you above to visualise how the model fits the standardised_time, the residuals, and a few other tools for assessing how well the model fits.

To do this we need to extract some values from the model, specifically thepredicted y values for each x value from the model, and the residuals. We can do this using the predict() and resid() functions:

##return the predicted (response) values from the model and add them to the single species tibble:
single_spp$pred_gaussian <- predict(mod1, 
                                    type="response")

##return the model residuals and add to the single species tibble:
single_spp$resid_gaussian <- resid(mod1)

Now we’ve added these data to our original data frame we can start to plot some of the plots in the style of the Error distributions section above.

First off let’s plot the data again, and add in the predicted values from the model as a line:

## plot the abundances through time
p2 <- ggplot(single_spp, aes(x = standardised_time, 
                             y = abundance)) + 
  geom_point() + 
  geom_line() + 
  theme_bw() +
  ylab("Abundance") +
  xlab("standardised_time")

##add in a line of the predicted values from the model
p2 <- p2 + geom_line(aes(x = standardised_time, 
                         y = pred_gaussian), 
                     col = "dodgerblue",
                     size = 1)

## we can also add in vertical blue lines which show the redsidual error of the model 
## (how far the observed points are from the predicted values). 
## in geom_segement we specify where we want the start and end of the segments (lines)
## to be. Without any prompting ggplot assumes that we want the start of the lines to
## be taken from the x and y values we are plotting using the ggplot() function 
## i.e. standardised_time and abundance, so we just need to specify the end points of
## the lines:
p2 <- p2 + 
  geom_segment(aes(xend = standardised_time, 
                   yend = pred_gaussian), 
               col="lightblue") 

## add a title
p2 <- p2 + ggtitle("Fitted model (gaussian with identity link)")

##print the plot
p2

Eye-balling this it looks like a pretty good fit, which is nice.

Next, let’s check the residuals around the fitted values:

##plot a histogram of the residuals from the model using geom_histogram()
p3 <- ggplot(single_spp, aes(x = resid_gaussian)) +
  geom_histogram(fill=I("goldenrod")) + 
  theme_minimal() + 
  ggtitle("Histogram of residuals (gaussian with identity link)")
## print the plot
p3 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The residuals aren’t what you would describe as “normally distributed” which is a concern.

Let’s look at how the residuals change with the predicted values.


Task

Try to make the plot below using your own code, plotting the predicted vs residuals in the single_spp data.frame


##make the plot of predicted vs residuals
p4 <- ggplot(single_spp, 
             aes(x = pred_gaussian, 
                 y = resid_gaussian)) + 
  geom_point() + 
  theme_minimal() + 
  xlab("Predicted values") + 
  ylab("residuals") + 
  ggtitle("Predicted vs residual (gaussian with identity link)") + 
  ##using geom_smooth without specifying the method (see later) means geom_smooth()
  ##will try a smoothing function with a formula y~x and will try to use a loess smoothing
  ##or a GAM (generalised additive model) smoothing depending on the number of data points
  geom_smooth(fill="lightblue", col="dodgerblue")

##print it
p4

Your plot definately shows a banana shape, not the required straight line. Actually we could probably have predicted this - if we look at the first plot we can see that at standardised_time ~ 0 and standardised_time > 750 the line constantly underpredicts the observed values, and hence we have higher residual errors - reflected in the banana shape curve you see above.

Let’s look at one last diagnostic plot, the Q-Q plot. If you aren’t familiar with QQ plots then find some information here. R has a helpful function to make these plots automatically:

##plot the qq plot for the residuals from the model assuming a normal distribution, 
## and add the straight line the points should fall along:
qqnorm(single_spp$resid_gaussian); qqline(single_spp$resid_gaussian)

We can see that these point also deviate from the expected straight line. QQ plots can give us quite a lot of information about the discrepencies in the data when compared to the theoretical disribution (in this case the gaussian) we have assumed describes the residuals in the model.


Task

Have a look at this application. Explore how the different skewness and tailedness in the data affect the QQ plots, and then decide what kind of skewness/tailedness we might have in our data plotted above.








Using this application it seems our data is likely to be negatively skewed. Let’s keep that in mind for later.

So what are our conclusion from this preliminary investigation? Well, we can see that the model does an OK but not great job of fitting to the data - we need to investigate some other possible models to see if we can identify one which does a better job.

3.1.7 Exploring alternative models

So the next thing to do is fit some alternative models to the data to see if they do a better job.

The data are counts (abundance of species at each standardised_time) so one very sensible option is a glm with poisson error distribution.

## fit a glm with a poisson distribution
mod2 <- glm(abundance ~ standardised_time, 
            data = single_spp, 
            family = "poisson")

Another option, give the skewness in the data we explored above, would be a gaussian glm with a log-link, which might help solve our skewness issue…

## fit a glm with a gaussian distribution with a log link
mod3 <- glm(abundance ~ standardised_time, 
            data = single_spp, 
            family = gaussian(link = "log"))

We could also try a guassian model with an inverse link:

## we could also try a guassian model with an inverse link
mod4 <- glm(abundance ~ standardised_time, 
            data = single_spp, 
            family = gaussian(link = "inverse"))

Ok, so now we have four potential models (mod1, mod2, mod3, mod4). We can compare the fits of these models to the data using the Akaike information criterion (AIC):

##compare the models 
AIC_mods <- AIC(mod1, 
                mod2, 
                mod3, 
                mod4)

## rank them by AIC using the order() function
AIC_mods[order(AIC_mods$AIC),]
##      df      AIC
## mod3  3 175.4301
## mod4  3 183.4976
## mod1  3 191.2363
## mod2  2 210.4472

Task

Try to work out how the order() function works - and why we use it in the square brackets for AIC-mods. To do this try running chunks of the code above in isolation.


Great. So it looks like the guassian model with a log-link (mod3) fits the data the best (has the lowest AIC). Conveniently none of these models are close (within 2 AIC) of the best fitting model, so we can concentrate on mod3 moving forward.

Whilst AIC tells us about the comparitive fits of the different models, it doesn’t tell us how well the models actually fit the data. They could all just fit it really badly, and mod3 just fits it least badly! So we need to go back and check the fits of the model again.

##return the predicted (response) values from the model and add them to the single species tibble:
single_spp$pred_gaussian_log <- predict(mod3, 
                                    type="response")

##return the model residuals and add to the single species tibble:
single_spp$resid_gaussian_log <- resid(mod3)

##first off let's plot the data again, and add in the predicted values from the model as a line. We can modify the plot we started earlier:
p5 <- ggplot(single_spp, aes(x=standardised_time, y=abundance)) + 
  geom_point() + 
  geom_line() + 
  theme_bw() +
  ylab("Abundance") +
  xlab("standardised_time")

##add in a line of the predicted values from the model
p5 <- p5 + geom_line(aes(x = standardised_time, 
                         y = pred_gaussian_log), 
                     col = "dodgerblue",
                     size = 1)

## we can also add in lines showing the distance of each observation from 
## the value predicted by the model (i.e. these lines visualise the residual error)
p5 <- p5 + geom_segment(aes(xend = standardised_time, 
                            yend = pred_gaussian_log), 
                            col="lightblue") 

## add a title
p5 <- p5 + ggtitle("Fitted model (gaussian with log link)")

##print the plot
p5

That looks like a good fit! But we need to check our other metrics we plotted out earlier. There is actually a convenient short-hand way of doing this in R - we can tell it to plot() a glm() object and it will automatically produce the following:

##plot the diagnostic graphics for model 3
plot(mod3)

Breifly, for the above plots:

  1. the residuals vs fitted doesnt show a clear trend, so thats ok
  2. the normal Q-Q closely follows the dotted line, so thats ok
  3. the residuals appear randomly distributed so there isnt any heteroskedasticity - so this is ok too
  4. none of the points are outside the 0.5 or 1 lines, so none of the data are exerting an overly strong effect on the predicted trends (if they were we might consider removing them).

In all, we have a model which fits the data well and which we can be pretty confident in.


Note - For a more indepth discussion of how to interpret the above diagnostic plots see the excellent blog post here.


3.1.7.1 Fitting a gaussian model to count data

One thing we should bear in mind is that the gaussian distribution is not specifically formulated for count data - it assumes that error values are going to be continuous, whereas we know that isn’t the case with our data. This can be a cause for concern, and whilst strictly we shouldnt do this (as it violates the assumptions of the normal distribution) we can usually get away with it when the mean of the dependant variable is high (as a rule of thumb > 50 - which is the case here).

3.1.8 Model summaries and outputs

So we have fit our model, and ensured that our model fits the data well. Now we want to look at what our model says about the data.

First off let’s look at the model output, using the summary() function:

##summarise the model outputs
summary(mod3)
## 
## Call:
## glm(formula = abundance ~ standardised_time, family = gaussian(link = "log"), 
##     data = single_spp)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -58.133  -18.337    0.437   16.949   55.486  
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        5.8867705  0.0534321   110.2  < 2e-16 ***
## standardised_time -0.0027565  0.0002377   -11.6 3.38e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 806.5102)
## 
##     Null deviance: 184068  on 17  degrees of freedom
## Residual deviance:  12904  on 16  degrees of freedom
## AIC: 175.43
## 
## Number of Fisher Scoring iterations: 4

You can see this gives us further details on the model. The important bits for us at this stage are the coefficients.

Note - for a full discussion of the output see the excellent posts here and here.

We can see there is a significant negative effect of standardised_time on the dependant variable (abundance). Its now easy to see why we have convereted the date to standardised_time, having converted these to a continuous numeric predictor (number of weeks since the first recorded abundance) we can more easily interpret the output: there is a decrease of 0.003 individuals per week over the time series.

We can plot the output of this model:

## first off let's plot the data again
p6 <- ggplot(single_spp, aes(x=standardised_time, y=abundance)) + 
  geom_point() + 
  geom_line() + 
  theme_bw() +
  ylab("Abundance") +
  xlab("standardised_time")

## use the geom_smooth() function to add the regression to the plot.
## unlike earlier here we are specifying the model type (glm), the formula, 
## and the error structure and link
p6 <- p6 + geom_smooth(data=single_spp,
                       method="glm", 
                       method.args = list(family = gaussian(link="log")), 
                       formula = y ~ x, 
                       col = "dodgerblue",
                       fill = "lightblue")

##print the plot
p6

You can see that ggplot() conveniently caculates the confidence intervals around the line, giving us a nice visualisation of the fitted model.

3.2 Mixed effects models - GLMMs

Generalized linear mixed-effect model (GLMMs) allow us to model data which are non-normally distributed and also include random effect structures within the data. These random effects could be differences between sites where we have collected data, data collected from different populations of the same species, etc and mean that there might be additional structure in the data which is not explained by our fixed effects. There are two kinds of random effect - random slopes, and random intercepts. In most cases we specify a random intercept - i.e. we assume that there are differences between the groups in where the regression line fit through the data cross the y axis, but there aren’t differences between the groups in the rate of change of that regression line (the slope).

If you aren’t comfortable with the concepts of mixed effect models then there is a fantastic visual explanation of the effects of data grouping available here which I recommend going over either before you continue or after this class - it makes GLMMs very intuitive.

GLMMs are a pain to fit well, and complex to investigate/assess. A very very good place for advice is Ben Bolker’s excellent post on frequently asked questions for GLMMs available here. There is another complimentary post where he works through a series of GLMM examples available here.

3.2.1 Visualise the data

We will do a brief investigation of GLMMs in R (again, this is a quick intro rather than an indepth dive - there is far too much to learn in 3 hours but this will give you the skills to fit GLMMs and learn how to learn statistics in R). For this let’s use the full species and stressors data base we filtered for the GLM section.

Again, we will add in a standardised time column to our data, starting at 0 weeks for the earliest recorded abundance for each population of each species:

## calculate standarised time per pop and per species. 
long_spp <-  long_spp %>% 
                group_by(species, population) %>%
                mutate(standardised_time = as.numeric(difftime(as.Date(date), 
                                                               min(as.Date(date)), 
                                                               units="weeks")))

## just look four of the columns to visualise the data 
## (you can look at all in Rstudio) but printing all the columns in 
## Rmarkdown means we won't see the standardised_time column
print(long_spp[,c("species", "population", "abundance", "standardised_time")], 10)
## # A tibble: 696 x 4
## # Groups:   species, population [59]
##    species                    population abundance standardised_time
##    <chr>                      <chr>          <dbl>             <dbl>
##  1 Schistidium helveticum     pop_1            353               0  
##  2 Schistidium helveticum     pop_1            419              52.1
##  3 Schistidium helveticum     pop_1            359             104. 
##  4 Schistidium helveticum     pop_1            363             156. 
##  5 Schistidium helveticum     pop_1            321             209. 
##  6 Schistidium helveticum     pop_1            280             261. 
##  7 Schistidium helveticum     pop_1            379             313  
##  8 Paraleucobryum longifolium pop_1             24               0  
##  9 Paraleucobryum longifolium pop_1             12              52.1
## 10 Paraleucobryum longifolium pop_1             18             104. 
## # … with 686 more rows

Task

There are a number of ways we could create the standardised time column (e.g. with week 0 set to the earliest recorded abundance of any species and population in the data, or creating the standardised time as we have done above where each species population has its own week 0). Each of these has different implicationsand makes different assumptions about what patterns we are looking for in the data. Try and identify what the implications of choosing these different standardised times would be.


In the introduction to this week’s workbook I gave a possible solution for last week’s homework:

We will set out to see - statistically - if there is any change in the abundances of the different species over time based on the number of stressors (as in the plot above).


Task

Implement the code (given at the beginning) to calculate the number of stressors for each populatin of each species.


Great, so we now have a data frame with standardised times and the number of stressors added to it, let’s first visualise the data:

## we will specify the x and y asethics here but specify the groups for the lines later
## this is because doing so means that the data for the geom_smooth() will then be all of the 
## data for each of the facets, but we can specify grouped data for the lines in geom_line
ggplot(long_spp, aes(x = standardised_time,
                     y = abundance)) + 
  ##add the points. we can use the group function to specify
  ## groups within the data (in this case species and population)
  ## that we want to treat as seperate data for the sake of plotting
  ## the `interaction()` function allows us to specify multiple levels to these data
  geom_line(aes(group = interaction(species, population))) + 
  facet_wrap(~n.threats) +
  theme_minimal() +
  ##fit a linear regression to the data to help visualise
  geom_smooth(method = "lm")

## I havent run this, but here is an example of what happens when we group 
## in the initial ggplot() function rather than in the geom_line()
## try running it and seeing what happens
ggplot(long_spp, aes(x = standardised_time,
                     y = abundance,
                     group = interaction(species, population))) + 
  geom_line() + 
  facet_wrap(~n.threats) +
  theme_minimal() +
  geom_smooth(method = "lm")

So, eyeballing it it looks like there might be some differences between the treatments (number of threats) - although if you run the second chunk of code above where a linear model is fit to each time series you will see that there is some variation in those rates of change through time.

Ok, let’s think about the modelling framework. We know we have the following variables in the data which we might want to consider:

  1. abundance - this will be our response variable
  2. standardised_time - we want to understand whether the number of threats affects the populations through time, so we still need to include the time element
  3. n.threats - the number of threats
  4. species - looking at the above plots it seems like the species are potentially behaving in slightly different ways
  5. population - for some of the species we have multiple populations, which we might want to consider seperatly.

Great so from this we can already start to think about a model structure based on what we are interested. We know its going to be something like:

abundance ~ standardised_time + n.threats

But remember we want to see whether the number of threats affects how population abundance changes through time, so we are going to need to consider an interaction between standardised_time and n.threats. We specify an interaction in R using ::

abundance ~ standardised_time + n.threats + standardised_time:n.threats

Great. Now let’s consider our other variables - species and population. We could consider species (and indeed population) as fixed effects in the model - if we were interested in them as part of the main part of our question. However, what we really want to know is what the effect of the number of threats is on abundance, we aren’t interested in the species effects per say.

This is where mixed models come to the fore - we want to look at what the effects of standardised_time and n.threats whilst accounting for possible other variables (species & population) which might explain some variation in the data but which we aren’t interested in the effects of.

There are a number of packages for fitting GLMMs in R - we are going to use the glmmTMB package because its really fast so is great for big data sets, and has good online help/tutorials.


Task

find, install, and load the glmmTMB package.


3.2.2 Coding mixed effects models

There are some differences between how GLMMs are coded between the different packages, but generally:

  • you specify random intercept effects using + (1 | group)
  • you specify random slope but fixed intercept using + (0 + x | group)
  • you specify random slope and intercept effects using + (x | group)

We are going to fit a random intercept model with intercepts for each of the species in our data using glmmTMB:

##fit a random effects model
m_mod1 <- glmmTMB(abundance ~ 
                    standardised_time + 
                    n.threats + 
                    standardised_time:n.threats + 
                    (1|species), 
                  data=long_spp, 
                  family="gaussian")

##look at the model summary:
summary(m_mod1)
##  Family: gaussian  ( identity )
## Formula:          
## abundance ~ standardised_time + n.threats + standardised_time:n.threats +  
##     (1 | species)
## Data: long_spp
## 
##      AIC      BIC   logLik deviance df.resid 
##   8002.8   8030.1  -3995.4   7990.8      690 
## 
## Random effects:
## 
## Conditional model:
##  Groups   Name        Variance Std.Dev.
##  species  (Intercept) 9602     97.99   
##  Residual             4447     66.69   
## Number of obs: 696, groups:  species, 51
## 
## Dispersion estimate for gaussian family (sigma^2): 4.45e+03 
## 
## Conditional model:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 255.547671  16.536993  15.453   <2e-16 ***
## standardised_time             0.038182   0.018306   2.086    0.037 *  
## n.threats                   -69.127191   5.394216 -12.815   <2e-16 ***
## standardised_time:n.threats  -0.091489   0.009579  -9.551   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

So if we look at this output we can see that the amount of the variance the random effect (species) accounts for in our model is given, as is the residual variance. Remember that we are assuming that species (the random effect) accounts for all the variance in the data not explained by our fixed effects (standardised_time and n.threats and their interaction).

We can calculate how much of the variance the random effect accounts for as the variance of the random effect divided by the sum of the random effect variance and residual variance:


Task

Calculate the variance explained by the random effect


9602 / (9602+4447)
## [1] 0.683465

So about 70%.

However, whilst we expect there to be differences in the intercept between the species (as coded before) there is also a good reason to think that there might be differences between populations of the same species as well. We call this a nested random effect.

3.2.3 Nested random effects

Nested random effects allow us to specify that some random effects (in our case population) should be considered within a larger random effect (species) rather than seperately in the model. In our case we don’t wan’t to consider species and population seperatly because:

  1. we expect that there is larger differences between the different species than between the populations
  2. the differences between populations should be considered within the umbrella effects of the species, as they are all individuals of the same species and so we expect that the populations will behave in similar ways
  3. not all of the species have multiple populations

It can be hard to get your head around this, but considered say a data set where we have the test results of individuals in classes in different schools. We want to know say if there is an effect of gender on the test results whilst accounting for the potential random effects of school and class. Clearly, the random effects of class should be nested in school (as the school determines the environment, practices, approaches, etc) which happen in the class, but the class identity might still have an effect as each class is taught by a different teacher.

So back to our example, we want to nest population inside species, and do this as follows:

##fit a random effects model with nested random effects
m_mod2 <- glmmTMB(abundance ~ standardised_time + n.threats + standardised_time:n.threats + (1|species/population), 
                  data=long_spp, 
                  family="gaussian")

##look at the model summary:
summary(m_mod2)
##  Family: gaussian  ( identity )
## Formula:          
## abundance ~ standardised_time + n.threats + standardised_time:n.threats +  
##     (1 | species/population)
## Data: long_spp
## 
##      AIC      BIC   logLik deviance df.resid 
##   7911.9   7943.7  -3949.0   7897.9      689 
## 
## Random effects:
## 
## Conditional model:
##  Groups             Name        Variance  Std.Dev. 
##  population:species (Intercept) 8.988e+03 9.481e+01
##  species            (Intercept) 5.554e-93 7.452e-47
##  Residual                       3.755e+03 6.128e+01
## Number of obs: 696, groups:  population:species, 59; species, 51
## 
## Dispersion estimate for gaussian family (sigma^2): 3.76e+03 
## 
## Conditional model:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 208.55777   18.16210  11.483  < 2e-16 ***
## standardised_time             0.07329    0.01797   4.079 4.53e-05 ***
## n.threats                   -33.47467    8.30603  -4.030 5.57e-05 ***
## standardised_time:n.threats  -0.11393    0.00934 -12.198  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now we can see that we have a new term (population:species) in our “conditional model” and again we can check how much variance is explained by these random effects.


Task

Calculate the variance explained by these random effect


(8.988e+03+5.554e-93)/(8.988e+03+5.554e-93+3.755e+03)
## [1] 0.7053284

Not much more tha just species by itself, HOWEVER we know a priori that population should nest within species (because we expect individuals of the same species to have e.g. similar reproductive rates, behaviours, mortality rates etc) and so we should include this in the model, regardless of whether it explains more of the variance (although we would expect it would).

3.2.4 Crossed random effects

We can also specify crossed random effects, where two or more variables can create distinct groupings:

y ~ x1 + x2 + (1|z1) + (1|z2)

Whether a random effect is nested or crossed can depend upon what scale you are looking at, so it isn’t neccesarily trivial to decide whether to nest or cross your random effects (and a full discussion is beyond today’s work) but there are some resources to help you get your head around it at the bottom of this workbook.

3.2.5 Assessing GLMM model fits

We will stick with our nested model design as there isnt any variable in the data which would suggest that we should include a crossed design:

abundance ~ standardised_time + n.threats + standardised_time:n.threats + (1|species/population)

So far we have considered what random effects should be in the model, but haven’t actually looked at the model fit! The basic approach of plotting a QQ plot we did above doesn’t mathmatically work well for GLMMs, however there are some handy tools in the DHARMa package which can help us assess the model fit more robustly. These work by using the model fit to simulate residuals which incorporate the uncertaintly in the model, see ?simulateResiduals for a good explanation.

##load DHARMa
library(DHARMa)

## simulate the residuals from the model
##setting the number of sims to 1000 (more better, according to the DHARMa help file)
m_mod2_sim <- simulateResiduals(m_mod2, n = 1000)

##plot out the residuals
plot(m_mod2_sim)

Ok so what do these plots tell us? Well on the left is a version of the QQ plot we saw earlier, and we read it in exactly the same way as a normal QQ plot. We also get some statistical tests to see how well our data fit the distribution - if we have a significant (p<0.05) test score then our data are significantly DIFFERENT from the distribution specified in the model - you can see here that this is the case.

On the right we have a version of the residuals vs predicted plot - we are looking for solid red lines which are close to horizontal (actually close to tracking the dashed red horizontal lines at 0.25, 0.5, and 0.75 - the quartiles of the data). Again, we can see our data are a poor fit and we get a warning about the quantile deviations in the title.

Conclusion? Our model is doing a poor job of fitting to the data.


Task

Head back to the online qqplot visualisations and play with the sliders and see if you can determine what sort of skew in the data we are dealing with.


So looking at the online model I would say we are looking at long tailed positively skewed data when compared to a normal distribution. Let’s see if we can solve this.

This is where understanding our data is important - why are we seeing these patterns? Well, one reason is because of this:

We have time series which go extinct, and then we have a series of 0’s recorded until the end of the time series (plotted as points in the graphic above).


Task

Consider whether we should include these 0’s in our analysis.








Well, let’s think about fitting a linear regression to the data above, where we either (1) include the zeros (red line in the plot below) or (2) don’t include them (yellow line):

You can see that - quite obviously - there is a big difference in our estimates of the rate of change of these populations over time and I hope that it should be obvious that we don’t want to include all of the 0’s in the analysis.

So do we just remove all of the 0s?

Well…no, because the first 0 (in chronological order) still gives us valid and important information on how fast the population is declining, so we should just exclude all the rest of the 0’s after that.

So how do we go about doing this?

Well, first we need to write a function to return the data we want (only one 0). Luckily our data finish with 0, 0, 0, 0 etc and don’t have something like 0, 0, 1, 0, 0 (which would make things trickier to code). This means we can tell R to return the data up to the first instance of a 0 being present:

##a function to return data up to and including the first 0 count:
single_zero <- function(x, group_info){
  if(nrow(x)>0){
    x <- x[ order(x$standardised_time) ,]
    if( min(x$abundance) == 0){
      return( x[1:min(which(x$abundance == 0)),] )
    }else{return(as_tibble(x))}
  }
}

Task

See if you can work out what the above function does. I suggest running it on a small subset of the full data, e.g. by using the [[2]] approach above to select out a data frame from the split_spp list. Hint - the group_info section in function(x, group_info) is necessary for the group_map() function to work, so you can ignore this.


Ok, so now we have a function (and you have checked it works using a subset of the data) then we can apply this function to our list. The tidyverse provides tools to do that - namely the group_map() function:

## make a new data frame
species_single0 <- long_spp %>% 
                      ##we want to cut the long_spp data up by the values in species and population
                      group_by(species, population) %>% 
                      ##the group_map function allows us to apply a function to these groups
                      ##and we want to keep the grouping variables, hence keep = T
                      group_map(single_zero, keep = T) %>% 
                      ##and then bind it all back together again to a tibble
                      ##otherwise it is returned as a list of the group data
                      bind_rows()
## Warning: The `keep` argument of `group_map()` is deprecated as of dplyr 1.0.0.
## Please use the `.keep` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Let’s check whether this has worked by looking at the data we plotted earlier

## # A tibble: 9 x 3
##   species            standardised_time abundance
##   <chr>                          <dbl>     <dbl>
## 1 Acaulon triquetrum               0         189
## 2 Acaulon triquetrum              52.3       114
## 3 Acaulon triquetrum             104.         88
## 4 Acaulon triquetrum             157.         36
## 5 Acaulon triquetrum             209.          3
## 6 Acaulon triquetrum             261           1
## 7 Acaulon triquetrum             313.          1
## 8 Acaulon triquetrum             365.          1
## 9 Acaulon triquetrum             417.          0

Great! It ends in a single 0. We can continue.

So our model didn’t fit well earlier, but might fit better now we have removed the data that shouldn’t have been there in the first place! So, let’s do some exploration.

So, we could fit a GLMM with a log link (as this is what worked best in the first example on a single species), however becasue we have 0 values in the data (and you can’t log a zero) we need to add one to the data. Additionally, we are using count data, so really the gaussian distribution isn’t a good option (unless the mean of the data are high, as they were when we fit the model to the single time series, but aren’t now we are dealing with values which decline down to 0).

So, what other options do we have? Well, count data are a very common kind of data to have, so luckily there is a raft of tools available for them. The following are some of the most commonly used count data distributions which can be fit in glmmTMB (see here:

  • poisson
  • generalised poisson
  • negative binomial (there are two different parameterizations of this available)
    • nbinom1
    • nbinom2

There are also truncated version of many of the above, but if you plot a histogram of our data then it doesn’t look like our data our truncated:

This is a fairly baffling array of options, but there are some tools we can use to help us assess these data.

Note - all of the below distribution identification techniques work well (often better) for GLMs.

The fitdistrplus library allows us to compare our data to distributions fit to it:

##load the fitdistrplus package
library(fitdistrplus)    

##fit a poisson distribution to our data:
fit_pois <- fitdist(species_single0$abundance, 
                    distr = "pois")

This throws up some warnings. After some digging these are because we have some zero frequencies in our data (we can ignore this).

##plot the data
plot(fit_pois)

##look at the summary statistics
gofstat(fit_pois)
## Chi-squared statistic:  Inf 
## Degree of freedom of the Chi-squared distribution:  19 
## Chi-squared p-value:  0 
##    the p-value may be wrong with some theoretical counts < 5  
## Chi-squared table:
##           obscounts   theocounts
## <= 1   2.800000e+01 6.081900e-57
## <= 5   3.300000e+01 2.049747e-50
## <= 11  3.100000e+01 5.019744e-43
## <= 19  3.100000e+01 2.708704e-35
## <= 30  2.700000e+01 5.877842e-27
## <= 41  2.900000e+01 2.226201e-20
## <= 55  2.700000e+01 8.179528e-14
## <= 68  2.800000e+01 4.194862e-09
## <= 81  2.700000e+01 1.849605e-05
## <= 92  2.800000e+01 4.473910e-03
## <= 102 2.700000e+01 2.142684e-01
## <= 121 3.000000e+01 2.940196e+01
## <= 145 3.200000e+01 3.654994e+02
## <= 162 2.900000e+01 1.832680e+02
## <= 187 2.700000e+01 2.156040e+01
## <= 222 2.800000e+01 5.147652e-02
## <= 249 2.700000e+01 6.213563e-08
## <= 293 2.700000e+01 6.661338e-14
## <= 356 2.700000e+01 0.000000e+00
## <= 436 2.700000e+01 0.000000e+00
## > 436  3.000000e+01 0.000000e+00
## 
## Goodness-of-fit criteria
##                                1-mle-pois
## Akaike's Information Criterion   82929.96
## Bayesian Information Criterion   82934.35

In the first plot (on the left) the red density shows the EXPECTED distribution of the data if they were from a poisson distribution, whilst the actual data are shown by the thin black bars - what we are looking for is the bars following the same distribution as the red area. You can instantly see that they aren’t the same, and this is backed up by the right hand plot (the cumulative density function) where the black and red lines should mimic each other.

This is further backed up by the p-value from the Goodness-of-fit test for poisson distribution - a significant P value here denotes a significant difference from the distribution we are testing the data against (poisson).

Conclusion - the poisson distribution is a bad fit for our data.

What about the negative negative distribution? This is a great (and more fleixible) option for fitting to count data:

##fit a nbinom to the data instead
fit_nbinom <- fitdist(species_single0$abundance, 
                      dist = 'nbinom')
##again we get warnings from those missing values and can ignore

##plot it out:
plot(fit_nbinom)

That looks a lot better! We are still underpredicting throughout the distribution, but the CDF looks good.

##the goodness of fit
gofstat(fit_nbinom)
## Chi-squared statistic:  31.67363 
## Degree of freedom of the Chi-squared distribution:  18 
## Chi-squared p-value:  0.02402514 
## Chi-squared table:
##        obscounts theocounts
## <= 1    28.00000   19.97953
## <= 5    33.00000   26.85885
## <= 11   31.00000   31.87077
## <= 19   31.00000   35.46763
## <= 30   27.00000   41.34348
## <= 41   29.00000   35.67087
## <= 55   27.00000   39.52625
## <= 68   28.00000   32.15955
## <= 81   27.00000   28.64267
## <= 92   28.00000   21.91913
## <= 102  27.00000   18.32195
## <= 121  30.00000   31.17071
## <= 145  32.00000   33.64150
## <= 162  29.00000   20.60556
## <= 187  27.00000   26.26787
## <= 222  28.00000   30.16536
## <= 249  27.00000   19.02378
## <= 293  27.00000   24.82173
## <= 356  27.00000   25.61594
## <= 436  27.00000   21.19383
## > 436   30.00000   35.73303
## 
## Goodness-of-fit criteria
##                                1-mle-nbinom
## Akaike's Information Criterion     7114.463
## Bayesian Information Criterion     7123.257

But we still have a signifiacant difference from the nbimon distribution (p<0.05). However, remember that above we are looking at the distribution of the WHOLE data, and we need to ensure that the residuals fit the model assumptions. So the above is SUGGESTING that the poisson is unlikely to be a good fit, and the the negative binomial might be better. Better still, we might consider a zero inflated negative binomial - which can account for a larger number of 0s than expected given the negative binomial distribution. So we might like to consider the following distributions for our model:

  1. poisson (unlikely to fit well)
  2. negative binomial
  3. zero inflated negative binomial

Let’s start by fitting the poisson model as a starting point. We are expecting this to fit the data poorly, but we can try it anyway. We are going to used scaled predictor variables.

Why standardised data?

Just as a quick aside - we often scale() data (so the data have a mean of zero (“centering”) and standard deviation of one (“scaling”)) when the continuous predictor variables we are using are on very different scales. In our case we have one (number of threats) which is between 0 and 3 and one (standardised time) which is between 0 and 991. This can cause issues with the model fitting, as the coefficents for one of your predictors might be vanishingly small (and therefore hard to estimate) when compared to your other predictors.

So here we fit our model, with scaled parameters, and population nested within species as our random effects:

## fit a poisson model
ps_mod <- glmmTMB(abundance ~ scale(standardised_time) * scale(n.threats) + (1 | species/population), 
                  data = species_single0,
                  family="poisson")
##summarise the model output
summary(ps_mod)
##  Family: poisson  ( log )
## Formula:          
## abundance ~ scale(standardised_time) * scale(n.threats) + (1 |  
##     species/population)
## Data: species_single0
## 
##      AIC      BIC   logLik deviance df.resid 
##  13277.4  13303.8  -6632.7  13265.4      594 
## 
## Random effects:
## 
## Conditional model:
##  Groups             Name        Variance  Std.Dev. 
##  population:species (Intercept) 1.409e+00 1.1870115
##  species            (Intercept) 4.827e-07 0.0006948
## Number of obs: 600, groups:  population:species, 59; species, 51
## 
## Conditional model:
##                                            Estimate Std. Error z value Pr(>|z|)
## (Intercept)                                4.158693   0.156990   26.49  < 2e-16
## scale(standardised_time)                  -0.547610   0.006039  -90.68  < 2e-16
## scale(n.threats)                          -0.995613   0.150798   -6.60 4.05e-11
## scale(standardised_time):scale(n.threats) -0.558804   0.005773  -96.80  < 2e-16
##                                              
## (Intercept)                               ***
## scale(standardised_time)                  ***
## scale(n.threats)                          ***
## scale(standardised_time):scale(n.threats) ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If we wanted to look at how well a linear regression fits the data we tend to use r-squared as a goodness of fit measure for our model (an r2 of 1 means the model exactly predicts the data). This measure falls apart when we consider GLMs and even more so when we consider GLMMs but we can calcualte a simple pseudo r2 value to help us roughly estimate the goodness of fit. There are a number of simple and complex ways to do this, but we will just implement the following:

##function to calculate a psudo R2 value for our GLMM
r2.corr.mer <- function(m) {
   lmfit <-  lm(model.response(model.frame(m)) ~ fitted(m))
   summary(lmfit)$r.squared
}

## apply it to our poisson model
r2.corr.mer(ps_mod)
## [1] 0.8672486

0.867 is a very good r2. There are also more complex methods e.g. that available in the MuMin package:

##r squared calcualted by MuMIn:
MuMIn::r.squaredGLMM(ps_mod)
##                 R2m       R2c
## delta     0.5182369 0.9981549
## lognormal 0.5182394 0.9981599
## trigamma  0.5182343 0.9981499

Which is lower.

We can also use a simple function to look at our overdispersion in the model:

##function to calculate overdispersion 
overdisp_fun <- function(model) {
    rdf <- df.residual(model)
    rp <- residuals(model,type="pearson")
    Pearson.chisq <- sum(rp^2)
    prat <- Pearson.chisq/rdf
    pval <- pchisq(Pearson.chisq, df=rdf, lower.tail=FALSE)
    c(chisq=Pearson.chisq,ratio=prat,rdf=rdf,p=pval)
}

##apply to our data
overdisp_fun(ps_mod)
##      chisq      ratio        rdf          p 
## 9357.20287   15.75287  594.00000    0.00000

The data are considered overdispersed when the chisq value is > the rdf value - i.e. when the ratio is > 1. Clearly these data are overdispersed!

Note - overdispersion is an issue only for distributions which dont estimate a scale parameter (i.e. poisson and binomial).

3.2.6 Fitting alternative model

So let’s fit some other potential models to the data. We can update() the model structure so we don’t have to keep typing out the full model again:

###try with nbinom: 
nb1_mod <- update(ps_mod, family="nbinom1")

##and the second parameterisation of nb
nb2_mod <- update(ps_mod, family="nbinom2")

We should also consider zero inflated versions of the NB distributions, as we know that we have 0s in our data. glmmTMB allows us to assume that all of the data are equally likely to have zero inflation (using the ziformula = ~1 argument), or alternatively to specify structures in the data which might explain our increase in zeros. In our case, if we look at our data plotted out above, we can see that (and logically this makes sense) the number of threats is likely to be a key driver of the number of zeros, as the number of threats seems to determine whether species are declining or not. In this case we can specify the number of threats as the potential driver of zeros in our data:

##zero inflated version of these with zeros being attributed to number of threats:
Zi_nb1_mod <- update(nb1_mod, ziformula = ~n.threats)
Zi_nb2_mod <- update(nb2_mod, ziformula = ~n.threats)

##and we can also look at the zero inflated version of the poisson model:
Zi_ps_mod <- update(ps_mod, ziformula = ~n.threats)

Ok, so let’s compare the model fits of these three models. We can do this using the anova() (analysis of variance) function which gives us additional information on top of what AIC() returns to us:

##compare the models
anova(ps_mod, 
      Zi_ps_mod,
      nb1_mod, 
      nb2_mod,
      Zi_nb1_mod,
      Zi_nb2_mod)
## Data: species_single0
## Models:
## ps_mod: abundance ~ scale(standardised_time) * scale(n.threats) + (1 | , zi=~0, disp=~1
## ps_mod:     species/population), zi=~0, disp=~1
## nb1_mod: abundance ~ scale(standardised_time) * scale(n.threats) + (1 | , zi=~0, disp=~1
## nb1_mod:     species/population), zi=~n.threats, disp=~1
## nb2_mod: abundance ~ scale(standardised_time) * scale(n.threats) + (1 | , zi=~n.threats, disp=~1
## nb2_mod:     species/population), zi=~n.threats, disp=~1
## Zi_ps_mod: abundance ~ scale(standardised_time) * scale(n.threats) + (1 | , zi=~0, disp=~1
## Zi_ps_mod:     species/population), zi=~0, disp=~1
## Zi_nb1_mod: abundance ~ scale(standardised_time) * scale(n.threats) + (1 | , zi=~0, disp=~1
## Zi_nb1_mod:     species/population), zi=~n.threats, disp=~1
## Zi_nb2_mod: abundance ~ scale(standardised_time) * scale(n.threats) + (1 | , zi=~n.threats, disp=~1
## Zi_nb2_mod:     species/population), zi=~n.threats, disp=~1
##            Df     AIC     BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)    
## ps_mod      6 13277.4 13303.8 -6632.7  13265.4                             
## nb1_mod     7  6113.1  6143.9 -3049.6   6099.1 7166.3      1     <2e-16 ***
## nb2_mod     7  6220.4  6251.1 -3103.2   6206.4    0.0      0          1    
## Zi_ps_mod   8 12477.4 12512.5 -6230.7  12461.4    0.0      1          1    
## Zi_nb1_mod  9  6041.6  6081.2 -3011.8   6023.6 6437.7      1     <2e-16 ***
## Zi_nb2_mod  9  6107.2  6146.8 -3044.6   6089.2    0.0      0          1    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see the zero inflated version NB1 model (Zi_nb1_mod) seems to fit the best of the models tested (with Zi_nb1_mod having the (lowest AIC) and if we look at the psudo-r2 values we can see that it is still fitting well:


Task

Use the r2.corr.mer() function to calculate the psudo r2 value for the Zi_nb1_mod.


##psudo r2 value
r2.corr.mer(Zi_nb1_mod)
## [1] 0.8599625

Sadly the MuMIn::r.squaredGLMM approach we tried earlier doesn’t (yet) handle zero inflated version of the NB in glmmTMB, however this model is looking like a pretty good fit for our data.

3.2.7 Temporal correlations in data

One other thing we might want to consider is correlations structures in our data. In this case we are fitting models to time series and because the value of any given time point relies on the value of the previous time point, we might need to consider some underlying stucture (autocorrelation) in the data we are analysing. We can imliment this in glmmTMB:

##fit a zero inflated negative binomial model with autocorrelation structure defined by the time variable
Zi_nb1_ar1_mod <- glmmTMB(abundance ~ scale(standardised_time) * scale(n.threats) + (1|species/population) + ar1(factor(scale(standardised_time)) - 1|species/population), 
                  data = species_single0,
                  ziformula=~n.threats,
                  family="nbinom1")
## Warning in fitTMB(TMBStruc): Model convergence problem; non-positive-definite
## Hessian matrix. See vignette('troubleshooting')
## Warning in fitTMB(TMBStruc): Model convergence problem; false convergence (8).
## See vignette('troubleshooting')

Note - we have to include the predictor for ar1() as a factor (thats just the way that glmmTMB needs it to be, don’t ask!), hence the factor(scale(standardised_time)).

Trying to fit this throws up a number of warnings/errors. If you run vignette('troubleshooting') you get some very useful help on these warnings. We can solve the main one (false convergence) fairly easily, by implementing a different maximumum likelihood optimizer:

##fit a zero inflated negative binomial model with autocorrelation structure defined by the time variable
Zi_nb1_ar1_mod <- glmmTMB(abundance ~ scale(standardised_time) * scale(n.threats) + (1|species/population) + ar1(scale(standardised_time)-1|species/population), 
                  data = species_single0,
                  ziformula=~n.threats,
                  family="nbinom1",
                  ##the control parameter specifying the optimizer to use:
                  control=glmmTMBControl(optimizer=optim, optArgs=list(method="BFGS")))
## Warning in fitTMB(TMBStruc): Model convergence problem; non-positive-definite
## Hessian matrix. See vignette('troubleshooting')

which solves one of the issues, but not:

Model convergence problem; non-positive-definite Hessian matrix

The troubleshooting vignette gives the following as probable reasons for this warning:

  1. when a model is overparameterized (i.e. the data does not contain enough information to estimate the parameters reliably)
  2. when a random-effect variance is estimated to be zero, or random-effect terms are estimated to be perfectly correlated (“singular fit”: often caused by having too many levels of the random-effect grouping variable)
  3. when zero-inflation is estimated to be near zero (a strongly negative zero-inflation parameter)
  4. when dispersion is estimated to be near zero
  5. when complete separation occurs in a binomial model: some categories in the model contain proportions that are either all 0 or all 1

If we look at the fixed effects of the model:

##show the fixed effects
fixef(Zi_nb1_ar1_mod)
## 
## Conditional model:
##                               (Intercept)  
##                                    4.3060  
##                  scale(standardised_time)  
##                                   -0.5596  
##                          scale(n.threats)  
##                                   -0.8121  
## scale(standardised_time):scale(n.threats)  
##                                   -0.5799  
## 
## Zero-inflation model:
## (Intercept)    n.threats  
##     -13.892        3.581

You can see that the zero inflation parameters (number 3) are not near zero. We also don’t have a dispersion parameter in the model (4) and we aren’t fitting a binomial model (5). Which leaves us with (1) and (2). If we look at the random effect sizes from the zero inflated negative binomial model without autocorrelation:

##look at the model
Zi_nb1_mod
## Formula:          
## abundance ~ scale(standardised_time) * scale(n.threats) + (1 |  
##     species/population)
## Zero inflation:             ~n.threats
## Data: species_single0
##       AIC       BIC    logLik  df.resid 
##  6041.627  6081.199 -3011.813       591 
## Random-effects (co)variances:
## 
## Conditional model:
##  Groups             Name        Std.Dev.
##  population:species (Intercept) 0.707994
##  species            (Intercept) 0.000331
## 
## Number of obs: 600 / Conditional model: population:species, 59; species, 51
## 
## Overdispersion parameter for nbinom1 family (): 15.2 
## 
## Fixed Effects:
## 
## Conditional model:
##                               (Intercept)  
##                                    4.4345  
##                  scale(standardised_time)  
##                                   -0.4922  
##                          scale(n.threats)  
##                                   -0.7201  
## scale(standardised_time):scale(n.threats)  
##                                   -0.5032  
## 
## Zero-inflation model:
## (Intercept)    n.threats  
##      -8.471        1.810

We can see that the random effect of species is quite small, so we could consider removing this from the model with autocorrelation and trying it again:

##zero inflated negative binomial model with autocorrelation and population as random effects
Zi_nb1_ar1_mod <- glmmTMB(abundance ~ scale(standardised_time) * scale(n.threats) + (1|population) + ar1(scale(standardised_time)-1|population), 
                  data = species_single0,
                  ziformula=~n.threats,
                  family="nbinom1",
                  ##the control parameter specifying the optimizer to use:
                  control=glmmTMBControl(optimizer=optim, optArgs=list(method="BFGS")))
## Warning in fitTMB(TMBStruc): Model convergence problem; non-positive-definite
## Hessian matrix. See vignette('troubleshooting')

However you can see we are still getting the same error so we will conclude that in this case the model is probably over parameterised (1). So let’s fall back to our previous best model -Zi_nb1_mod. If we thought there was likely to be significant autocorelation structure then we would need to think about going back and simplifying our original model.

3.2.8 model diagnostics with DHARMa

The DHARMa package allows us to implement some great diagnostic tools to check our model fits (only defined for some distributions at the moment, luckily the zero inflated NB is one of them):

##load DHARMa
library(DHARMa)

## simualte the residuals from the model
##setting the number of sims to 1000 (more better, according to the DHARMa help file)
Zi_nb1_mod_sim <- simulateResiduals(Zi_nb1_mod, n = 1000)

##plot out the residuals
plot(Zi_nb1_mod_sim)

We can interpret the QQ plot as normal - and it looks good! We have no significant values for any of the tests (which would indicate that the distribution is a poor fit to the data).

On the right we have a plot of the predictions vs residuals. Here we are hoping to see straight red solid lines across the plot which match the dashed red lines on the plot, but as you can see there is some deviation from that and we are getting the warning that quantile deviations detected in the residuals. This isn’t ideal but visually doesn’t appear very bad in this case. Ideally we would delve into this some more (see below in More possible models). There is an excellent (and long) vignette for DHARMa available here which you can delve into.

3.2.9 Additional DHARMa tests

DHARMa also provides a suite of other test functions which allow us to run diagnostics on the simulated outputs from simulateResiduals():

## test to see where there are outliers, in our case not significant so we dont need to worry
testOutliers(Zi_nb1_mod_sim, 
             plot = TRUE)

## 
##  DHARMa bootstrapped outlier test
## 
## data:  Zi_nb1_mod_sim
## outliers at both margin(s) = 0, observations = 600, p-value = 0.9
## alternative hypothesis: two.sided
##  percent confidence interval:
##  0.00000 0.02175
## sample estimates:
## outlier frequency (expected: 0.00338333333333333 ) 
##                                                  0
## tests if the simulated dispersion is equal to the observed dispersion,
## again not significant so no need to worry
testDispersion(Zi_nb1_mod_sim, 
               plot = TRUE)

## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## ratioObsSim = 0.78004, p-value = 0.39
## alternative hypothesis: two.sided
## compares the distribution of expected zeros in the data against the observed zeros
## this is right on the borderline of being significant, suggesting there might be a better structure for
## our zero inflation parameter (remember we used ~n.threats). That might be worth looking into further
testZeroInflation(Zi_nb1_mod_sim, 
                  plot = TRUE)

## 
##  DHARMa zero-inflation test via comparison to expected zeros with
##  simulation under H0 = fitted model
## 
## data:  simulationOutput
## ratioObsSim = 0.57281, p-value = 0.05
## alternative hypothesis: two.sided
## see if there is temporal autocorrelation in the residuals
## not significant, so it turns out we didnt need to try and fit the autocorrelation model earlier on!
testTemporalAutocorrelation(Zi_nb1_mod_sim, 
                            plot = TRUE) 
## DHARMa::testTemporalAutocorrelation - no time argument provided, using random times for each data point

## 
##  Durbin-Watson test
## 
## data:  simulationOutput$scaledResiduals ~ 1
## DW = 2.0987, p-value = 0.2261
## alternative hypothesis: true autocorrelation is not 0

So overall it seems like our model is a pretty good fit to the observed data.

3.2.10 What does it mean when a model doesn’t fit properly?

It might be worth just briefly considering what the implications of a poorly fitting model are, and how much of an issue is it? Well this really depends upon the questions you are interested in, and what you want to do with your model.

In our case, we are interested to see if there is a significant effect of the number of threats on how population abundances change through time. In this case, we aren’t neccesarily interested in the exact values of the coefficients (if the effect of an additional stressor is -0.10224 vs -0.10890 doesn’t make a big difference to us). We do want to be sure that the patterns we are seeing in the model are reflected in the data, and the fact that they are highly significant and the effect sizes are large means we can be quite confident in this even if the model fit isn’t “perfect”.

However, if we have marginal effects with small effect sizes then we are likely to be more concerned about how well our model fits to be sure whether the effects we are reporting are in fact observed in the data, rather than a function of a poor model fit.

Likewise, if we are fitting a model to data and then using that model to - say - parameterise a simulation, or project the trends beyond the data we have collected then we are again going to be much more concerned about how well our model fits, as any errors in the coefficient estimates are going to be propagated as we use those estimates in other analyses. In this case we might want to fit something more flexible to our data, like a Generalised Additive Model (GAMs) or a Generalised Additive Mixed Model (GAMMs) which can deal well with non-linear effects and thus will make reliable predictions within the bounds of the data you are fitting the model to, but are difficult to interpret when trying to identify the effects of fixed effects and cannot be used to extrapolate beyond the data you have. You can see the flexibility of GAMs below (purple line) vs a linear model (yellow line).

In our case, we will will continue with the best model we identified above and look at interpreting and plotting the outputs.

3.2.11 Final visual checks

As you can see it has taken a long time to get to here, and if your data is very complex, or very non-normal, this process takes even longer. I like to do one final easy check to look at how well our model is working, and this is to simply compare the model predictions to the observed values. We will go back to our old friend the predict()function for this:

## add in the predicted values from the model:
species_single0$predicted <- predict(Zi_nb1_mod, 
                                     data = species_single0, 
                                     type = "response")

##plot the predicted against the observed
ggplot(species_single0, aes(x = abundance, 
                            y = predicted)) + 
  geom_point(col="grey") + 
  geom_abline(slope = 1) +
  theme_minimal() +
  xlab("Observed") +
  ylab("Predicted")

The points should then fall close to the 1:1 line, and you can see our model does a pretty good job. We can see that it both underpredicts and over predicts the values in some places, but we expect that (we won’t ever have a model which fits perfectly, after all - the whole point of a model is to simplify the patters in data so we can interpret them) and there aren’t any clear patterns (like only underpredicting at high observed values).

Phew! All that effort was worth it!

There are of course a lot more ways to go about visualising models, and Ben Bolker gives a much more detailed deep dive into diagnostic plotting here, which I highly reccomend you look through.

3.2.12 More possible models

We said before we might want to consider a more complex zero-inflation structure than the one we currently have, and we may also want to think about explicitly modelling a dispersion parameter (although our tests above don’t highlight this as a particular issue, its worth knowing this is possible), e.g. n.threats might be a predictor of increased dispersion in the data. glmmTMB allows us to do this, but we aren’t going to dig into that today. However if we weren’t confident about the model fits (low pseudo r2, poor predicted vs observed plots, etc), or wanted to check some other model structures, that would be a good place to start.

3.2.13 What does our model ACTUALLY SAY?

So finally, what does our model say?


Task

Print the summary of the Zi_nb1_mod.

##summarise the model 
summary(Zi_nb1_mod)
##  Family: nbinom1  ( log )
## Formula:          
## abundance ~ scale(standardised_time) * scale(n.threats) + (1 |  
##     species/population)
## Zero inflation:             ~n.threats
## Data: species_single0
## 
##      AIC      BIC   logLik deviance df.resid 
##   6041.6   6081.2  -3011.8   6023.6      591 
## 
## Random effects:
## 
## Conditional model:
##  Groups             Name        Variance  Std.Dev.
##  population:species (Intercept) 5.013e-01 0.707994
##  species            (Intercept) 1.095e-07 0.000331
## Number of obs: 600, groups:  population:species, 59; species, 51
## 
## Overdispersion parameter for nbinom1 family (): 15.2 
## 
## Conditional model:
##                                           Estimate Std. Error z value Pr(>|z|)
## (Intercept)                                4.43446    0.09774   45.37  < 2e-16
## scale(standardised_time)                  -0.49215    0.02231  -22.06  < 2e-16
## scale(n.threats)                          -0.72013    0.09532   -7.55 4.21e-14
## scale(standardised_time):scale(n.threats) -0.50318    0.02139  -23.52  < 2e-16
##                                              
## (Intercept)                               ***
## scale(standardised_time)                  ***
## scale(n.threats)                          ***
## scale(standardised_time):scale(n.threats) ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Zero-inflation model:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -8.4707     1.3971  -6.063 1.34e-09 ***
## n.threats     1.8103     0.5485   3.300 0.000966 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Well, it says:

  1. on average as time increases abudances decline
  2. as the number of threats increases the average population size decreases
  3. the interaction states that as those species with more threats decline faster through time than those with fewer threats

3.2.14 Plotting model outputs

There are a host of ways to plot out but this one is adapted from the glmmTMB paper appendix:

## make a blank data set which includes the variables in the model
## we will then use this to generate predicted values from the model for 
## various combinations of number of threats, standardised time, species,
## and population
## we can use the unique() function across the columns in our data.frame
## to retrieve every unique combination of:
##n.threats, standardised_time, species, and population
new_data <-  unique(species_single0[,c("n.threats",
                                       "standardised_time", 
                                       "species", 
                                        "population")])

##scale the relevant columns (remember our model is expecting scaled data)
new_data$n.threats<-scale(new_data$n.threats)
new_data$standardised_time<-scale(new_data$standardised_time)

##set the random effects of the model to zero
X_cond <- model.matrix(lme4::nobars(formula(Zi_nb1_mod)[-2]), new_data)
beta_cond <- fixef(Zi_nb1_mod)$cond
pred_cond <- X_cond %*% beta_cond
ziformula <- Zi_nb1_mod$modelInfo$allForm$ziformula
X_zi <- model.matrix(lme4::nobars(ziformula), new_data)
beta_zi <- fixef(Zi_nb1_mod)$zi
pred_zi <- X_zi %*% beta_zi

##transform point estimates of the unconditional counts to the response scale and multiply
##(because they are logged and on logit link)
pred_ucount = exp(pred_cond)*(1-plogis(pred_zi))

##load the MASS library
library(MASS)

##set the random number generator seed
set.seed(101)

##use posterior predictive simulations to generated upper and lower confidence intervals
## and median predicted counts
##ignoring variabtion in the random effects

##conditional
pred_condpar_psim = mvrnorm(1000,mu=beta_cond,Sigma=vcov(Zi_nb1_mod)$cond)
pred_cond_psim = X_cond %*% t(pred_condpar_psim)

##zero inflation parameter
pred_zipar_psim = mvrnorm(1000,mu=beta_zi,Sigma=vcov(Zi_nb1_mod)$zi)
pred_zi_psim = X_zi %*% t(pred_zipar_psim)

##transform them
pred_ucount_psim = exp(pred_cond_psim)*(1-plogis(pred_zi_psim))

##calculate 95% CIs
ci_ucount = t(apply(pred_ucount_psim,1,quantile,c(0.025,0.975)))
ci_ucount = data.frame(ci_ucount)

##rename
names(ci_ucount) = c("ucount_low","ucount_high")

##put into a data frame
pred_ucount = data.frame(new_data, 
                         pred_ucount, 
                         ci_ucount)

##we need to reverse the scaling of our predictor variables for our plots to make sense
##the scale() function stores attributes of the scaling in the vectors of scaled data 
## try running new_data$n.threats and looking at the bottom values
##write a function to do this:
unscale <- function(x){
  x * attr(x, 'scaled:scale') + attr(x, 'scaled:center')
}

##unscale the variables
pred_ucount$n.threats_unscaled <- unscale(pred_ucount$n.threats)
pred_ucount$standardised_time_unscaled <- unscale(pred_ucount$standardised_time)

##load the viridis package (colourblind friendly palletes)
library(viridis)

##plot out the predicted median values for abundance
## in response to time (x-axis)
##and grouped by the number of threats
ggplot(pred_ucount, aes(x = standardised_time_unscaled, 
                        y = pred_ucount, 
                        group = n.threats_unscaled, 
                        col = n.threats_unscaled))+ 
  ##median lines for each number of threats
  geom_line() +
  ##add in a geom_ribbon to show the 95% CI
  geom_ribbon(aes(ymin = ucount_low,
                  ymax = ucount_high), 
              alpha = 0.1, 
              col = "grey", 
              linetype=0) +
  ##minimal theme
  theme_minimal() +
  ##set x and y axes labels
  ylab("Predicted\nabundance") + xlab("Time\n(weeks)") +
  ##viridis colour pallette for continuous data
  scale_colour_viridis_c() +
  ##move legend to the top
  theme(legend.position = "top") +
  ##rename the legend
  labs(colour="Number of threats")

Brilliant! we have a clear visualisation of the effects of multiple threats on the dynamics of populations through time, and we can say with some certainty that the more stressors you have the faster you are likely to be declining.

4 Key Concepts

Let’s quickly summarise what we have learned today:

  • What is a GLM (and a GLMM)
  • What are error distributions
  • What are link functions
  • What do we do when we dont have normally distributed data
  • What error distributions should we consider for count data
  • How to fit a GLM/M
  • How to decide on nested/crossed random effects
  • How to account for temporal autocorrrelations
  • How to visualise residuals (difference between predicted and observed)
  • How to tell if a model is fitting our data well
  • How to explore alternative models
  • How to compare models
  • How to interpret model summaries
  • How to plot the results of GLMMs

Phew! Thats a lot to cover in one week. Don’t worry if you didn’t manage to work through everything or some parts you find difficult - you can always come back to this and refer to it at any time.

5 Functions covered today

Function/operator Call
Apply smoothing visualisation to ggplot() geom_smooth()
Calculate difference between dates difftime()
Run a GLM model glm()
Return predicted values from a model predict()
Return the residuals of a glm resid()
Make a QQ plot qqnorm()
Add a QQ line to a QQ plot qqline()
Compare models using AIC AIC()
Summarise a model summary()
Order the items in a column order()
Fit a glmm with TBM glmmTMB
Create scaled residuals from a glmm simulateResiduals()
Apply a function to grouped data in tidyverse group_map()
Fit a distribution to observations fitdist()
Generate summary statistics from a fitted distribution gofstat()
Calculate pseudo R-squared for glmms r.squaredGLMM()
Update a model with a new argument update()
Compare models with AIC and give more info anova()
Test for significant outliers on simulateResiduals() output testOutliers()
Test for significant dispersion on simulateResiduals() testDispersion()
Test for significant zero inflation on simulateResiduals() testZeroInflation()
Test for significant temporal autocorrelation in the residuals from simulateResiduals() testTemporalAutocorrelation()
Return fixed effects of glmm fixef()
Compute exponential exp()
Plot a ribbon geom geom_ribbon()
Apply a continuous viridis colour scheme scale_colour_viridis_c()

8 Homework

So this week’s homework will build on our delve into fitting GLMs to data. In your cloned Bioinformatics_data repository in the Workshop 5 folder you will find four different data sets. You will need to analyse these and for each identify:

  1. what the error distribution of the data are
  2. which of the predictor (x) variables have a significant effect on the y variable

You can do these in a GLM framework as there aren’t any random effects in the data.

Answers to be revealed at the start of next week!